-
Notifications
You must be signed in to change notification settings - Fork 165
Cache and print devices for debugging future outages #2141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
changes and printing the full list. example: periodic symlink cache read: /dev/disk/by-id/google-persistent-disk-0 -> /dev/sda; /dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb; /dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda; /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb
easier to unit test.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cemakd The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe the testing that has been done for this change?
pkg/linkcache/devices_linux.go
Outdated
return newDeviceCacheForNode(period, node), nil | ||
} | ||
|
||
func TestDeviceCache(period time.Duration, node *v1.Node) *DeviceCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the test functions to the _test.go file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were not real test function but rather creating a DeviceCache for testing purposes used by other packages. I've renamed them to NewTestDeviceCache
to distinguish them from actual test functions.
pkg/linkcache/devices_linux.go
Outdated
// of the string (after the last "/") and call AddVolume for that | ||
for _, volume := range node.Status.VolumesInUse { | ||
klog.Infof("Adding volume %s to cache", string(volume)) | ||
vID, err := pvNameFromVolumeID(string(volume)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this ignore volume types that are not PD/hyperdisk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not ignoring them, I've pushed a commit that checks if the volume is for the driver pd.csi.storage.gke.io
. IIUC that should be sufficient to do the filtering.
pkg/linkcache/devices_linux.go
Outdated
// Evaluate the symlink | ||
realPath, err := filepath.EvalSymlinks(device.symlink) | ||
if err != nil { | ||
klog.Warningf("Error evaluating symlink for volume %s: %v", volumeID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful of logging this since this is called repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the polling period is 10 mins, would it be okay to log this? Otherwise it might be hard to figure out any problems with realpath evaluation?
I'm not sure if that's a realistic problem to log, I can remove this warning log if you think it's not necessary
cmd/gce-pd-csi-driver/main.go
Outdated
@@ -275,6 +276,13 @@ func handle() { | |||
klog.Fatalf("Failed to get node info from API server: %v", err.Error()) | |||
} | |||
|
|||
deviceCache, err := linkcache.NewDeviceCacheForNode(ctx, 1*time.Minute, *nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the time period configurable so that we can adjust it if it ends up spamming too many logs? In fact I would probably start the logging at 10 mins initially.
In the future, if we want to turn this into a metric, I would be more comfortable reducing the polling period but removing logs to avoid log spam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a flag disk-cache-sync-period
with a default value of 10 minutes when unspecified.
pkg/linkcache/devices_linux.go
Outdated
} | ||
|
||
// Look at the dir for a symlink that matches the pvName | ||
symlink := filepath.Join(d.dir, "google-"+deviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested with nvme disks as well? It would be better to reuse functions in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/pkg/deviceutils/device-utils.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done specific testing for nvme disks but I've pushed a new commit that utilizes device-utils.go for evaluating the symlinks for a volumeID
I haven't done any additional testing on this PR since I took it over from Julian. As far as I know he was struggling to get the e2e tests passing, which I fixed with a few nil pointers checks. I'll do some more testing on this and get back to you |
Change volumes map to use symlink as key and add volumeID and real path to the deviceMappings object, this is because there can be multiple symlinks per volume Filter out caching and logging info on non pd/hyperdisk volumes Failure to get real path at add volume time is no longer an error the real path may be unavailable at this time Fix the device_windows.go NewDeviceCacheForNode parameters
61ea81b
to
61aeffd
Compare
@cemakd: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This is a copy of PR: #2097
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR adds a cache that periodically (configured to every minute currently) looks at the /dev/disk/by-id/ directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.
This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.
This PR will print those links every minute, also logging when elements of the cache change.
An example:
The cache will also note if a symlink is broken.
NOTE: Currently this filters out any thing in by-id/ that ends with -part[0-9]*$. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: